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Abstract 

We  propose  a  comprehensive  metrics  validation  methodology  that  has 
ix  validation  criteria,  each  of  which  supports  certain  quality 
unctions.  New  criteria  are  defined  and  illustrated,  including 
consistency,  discriminative  power,  tracking  and  repeatability.  We  show 
ihat  non-parametric  statistical  methods  play  an  important  role  in 
evaluating  metrics  against  the  validity  criteria.  A  detailed  example 
Df  the  application  of  the  methodology  is  presented. 

keywords:  metrics  validation  methodology,  validity  criteria,  quality 
functions,  non-parametric  statistical  methods. 

INTRODUCTION 

If  the  software  engineering  community  believes  that  the  field  of 
netrics  should  be  engineering  and  not  art,  then  it  should  subscribe  to 
:he  idea  that  we  evaluate  (validate)  whether  metrics  measure  what  they 
purport  to  measure  prior  to  using  the  metrics.  Furthermore,  if  metrics 
are  to  be  of  greatest  utility,  the  validation  should  be  performed  in 
:erms  of  the  quality  functions  (quality  assessment,  control  and 
prediction)  that  the  metrics  are  to  support. 

Our  purpose  is  to  propose  and  illustrate  a  validation  methodology 
^hose  adoption,  we  believe,  would  provide  a  rational  basis  for  using 
netrics.  We  believe  this  to  be  the  most  comprehensive  metrics 
nethodology  ever  proposed.  There  have  been  useful  validation  analyses 
performed  on  specific  metrics  or  metric  systems  for  the  purpose  of 
satisfying  specific  research  goals.  Among  these  validations  are  the 
following:  1)  function  points  as  a  predictor  of  work  hours  across 
iifferent  development  sites  and  sets  of  data  [2];  2)  reliability  of 
netrics  data  reported  by  programmers  [3];  3)  Halstead  operator  count 
for  Pascal  programs  [7];  4)  metric-based  classification  trees  [18];  5) 
evaluation  of  metrics  against  syntactic  complexity  properties  [19]. 
Dur  approach  to  validation  differs  in  the  following  ways:  1)  The 
■nethodology  is  general  and  not  specific  to  particular  metrics  or 
research  objectives.  2)  It  is  developed  from  the  point  of  view  of  the 
metric  user  (rather  than  the  researcher),  who  has  requirements  for 
assessing,  controlling  and  predicting  quality.  To  illustrate  the 
difference  in  viewpoint,  we  can  make  an  analogy  with  the  automobile 
industry:  the  manufacturer  has  an  interest  in  brake  lining  thickness, 
as  it  relates  to  stopping  distance,  but  from  the  driver's  perspective, 
the  only  meaningful  metric  is  stopping  distance!  3)  It  consists  of  six 
mathematically  defined  criteria,  each  of  which  is  keyed  to  a  metrics 
function,  so  the  user  of  metrics  can  understand  how  a  characteristic 
of  a  metric,  as  revealed  by  validation  tests,  can  be  applied  to 
measure  software  quality.  4)  It  includes  new  criteria:  consistency, 
discriminative  power,  tracking  and  repeatability.  5)  It  recognizes 
that  a  given  metric  can  have  multiple  uses  (e.g. ,  assess,  control  and 
predict  quality)  and  that  a  given  metric  can  be  valid  for  one  use  and 
invalid  for  another  use.  6)  It  includes  some  useful  statistical 
methods,  rarely  seen  in  the  metrics  literature,  that  are  applied  to 
metrics  validation:  partial  linear  correlation  analysis,  chi-sguare 
test  for  differences  in  probabilities  (contingency  tables), 
discriminant  analysis  and  runs  test. 


It  is  not  our  purpose  to  be  a  proponent  or  an  opponent  of  given 
metrics.  Whether  certain  metrics  pass  or  fail  our  validity  tests  in 
the  examples  is  not  the  point  of  this  paper.  The  examples  are  for  the 
sole  purpose  of  illustrating  the  application  of  the  validation 
methodology.  The  validation  results  could  be  different  in  other 
applications  and  environments. 

We  emphasize  the  use  of  non-parametric  statistical  techniques  for 
metrics  validation  because:  1)  their  application  is  more  consistent 
with  the  nature  of  metrics  data  (e.g.,  non-linearity,  non-normality, 
large  variability)  than  are  parametric  techniques  and  2)  the  measures 
that  result  from  their  application  are  useful  for  quality  assessment 
and  control. 

Outline  of  Paper 

The  following  subjects  are  covered: 

o  Definitions. 

o  Rationale  of  Metrics  Validation. 

o  Quality  Functions. 

o  Non-parametric  Statistical  Methods  for  Metrics  Validation. 

o  Purpose  of  Metrics  Validation. 

o  Validity  Criteria. 

o  Example  of  Metrics  Validation. 

o  Summary  and  Future  Research. 

DEFINITIONS 


Critical  Value 


Quality  Assessment 


Metric  value  of  a  validated  metric  which  is 
used  to  identify  software  which  has 
unacceptable  quality  [11]. 

Evaluation  of  the  relative  quality  of  software 
components. 


Quality  Attribute 


A  feature  or 
item's  quality 


characteristic 
[13]. 


that  affects  an 


Quality  Control 


A  set  of  activities  designed  to 
quality  of  developed  components 
of  13]. 


evaluate  the 
[modification 


Quality  Factor 


An  attribute  of  software  that  contributes  to 
its  quality  [11].  A  quality  factor  is  also  a 
metric. 


Quality  Metric 


A  function  whose  inputs  are  software  data  and 
whose  output  is  a  single  (numerical)  value  that 
can  be  interpreted  as  the  degree  to  which 
software  possesses  a  given  attribute  that 
affects  its  quality  [13]. 


Quality  Prediction 


A  forecast  of  component  quality. 


Quality  Requirement   A  requirement  that  a  software  attribute  be 

present  in  software  to  satisfy  a  contract, 
standard,  specification,  or  other  formally 
imposed  document  [11]. 

Software  Component     General  term  used  to  refer  to  an  element  of  a 

software  system,  such  as  module,  unit,  data  or 
document  [11]. 

Software   Quality      The   degree   to   which   software   possesses   a 

desired  combination  of  attributes  [12]. 

Validated  Metric       A  metric  whose  values  have  been  statistically 

associated  with  corresponding  quality  factor 
values  [ 11 ] . 

For  simplicity  of  expression,  terms  will  be  used  without  the 
qualifying  word  ('metric'  instead  of  'quality  metric1)  in  the 
remainder  of  the  paper  except  in  the  case  of  'quality  factor'  which 
Will  be  used  to  distinguish  it  from  'factor'  of  the  statistical  method 
'factor  analysis". 

RATIONALE  FOR  METRICS  VALIDATION 

To  help  ensure  that  metrics  are  used  appropriately,  only  validated 
■netrics  (i.e.,  either  quality  factors  or  metrics  validated  with 
respect  to  quality  factors)  should  be  used.  Quality  factors  are  valid 
by  definition.  Furthermore,  the  metrics  which  are  used  should  be  those 
tfhich  are  associated  with  the  quality  requirements  of  the  software 
project.  Both  product  and  process  metrics  are  used  to  assess  software 
quality.  Our  statements  about  product  elements  (i.e.,  components) 
apply  equally  to  the  processes  which  produce  the  products. 

It  should  be  understood  that  if  a  metric  is  validated  according  to 
our  criteria,  there  is  no  guarantee  that  it  will  faithfully  represent 
a  quality  factor  when  applied.  Validation  is  a  statistical  concept.  As 
such,  validation  can  only  be  performed  within  statistical  error 
limits.  The  major  benefit  of  validation  is  that  it  increases  the 
probability  that  the  metric  will  be  a  good  indicator  of  quality. 

QUALITY  FUNCTIONS 

Metrics  are  applied  in  three  major  quality  functions:  Quality 
Assessment,  Quality  Control  and  Quality  Prediction.  If  metrics  are  to 
aid  in  making  decisions  about  software  quality,  the  user  of  metrics 
must  understand  how  this  tool  supports  major  quality  functions  in  a 
software  engineering  organization.  Since  metrics  should  not  be 
validated  unless  the  applications  of  metrics  are  clearly  understood, 
it  is  worthwhile  to  describe  the  role  of  metrics  during  various 
software  phases  and  the  need  to  validate  the  metrics  for  specific 
metrics  functions  (i.e.,  the  relationship  must  be  made  between 
(quality  functions  and  validity  criteria).  Otherwise,  a  correlation 
coefficient  of  .9  between  metric  X  and  quality  factor  Y,  for  example, 
is  only  an  abstraction.  It  only  has  meaning  if  validated  in  the 
context  of  quality  functions.   These   purposes  are  best   served  by 


introducing  validity  criteria  on  a  qualitative  basis  now;   later, 
mathematical  definitions  will  be  provided  in  the  validation  section. 

QUALITY  ASSESSMENT 
Associativity- 
Software  managers  need  a  rational  basis  for  allocating  personnel 
and  computer  resources  to  inspection,   testing,   and  other  quality 
activities.  A  method  for  doing  this  is  to  use  metrics  to  provide  a 
measure  of  relative  quality  across  components.   For  example,   the 
magnitudes  of  a  metric  are  used  to  establish  priority  of  testing  and 
allocation  of   budget   and  effort   to   testing   (i.e.,   the   " worst" 
component  would  receive  the  most  attention,  largest  budget  and  most 
staff).  One  way  to  assess  relative  quality  is  as  follows: 

If  the  elements  of  a  metric  vector  M,  corresponding  to  components 
1,2,  ...,n,  are  ordered  by  magnitude,  as  shown  below,  does  this  imply 
an  ordering  of  component  quality? 

Magnitude[Ml  >  M2 ,  .  .  .  ,  >  Mn]  =>  Monotonically  Increasing  Quality? 

(Decreasing) 

The  validity  criterion  which  assesses  the  degree  to  which  this 
relationship  is  satisfied  is  called  associativity.  A  metric  that  is 
validated  according  to  this  criterion  is  used  to  compare  magnitudes  of 
a  metric  obtained  from  different  components  to  estimate  the  degree  to 
which  they  differ  in  quality  (e.g. ,  he  quality  of  Component  2  is 
twice  that  of  Component  1 ' ) . 

Consistency 

It  may  be  that  the  software  manager  is  only  interested  in  whether 
* Component  2  is  better  than  Component  1'  rather  than  how  much  better. 
This  approach  has  the  advantage  of  not  requiring  a  linear  relationship 
between  quality  factors  and  metrics  in  order  to  have  perfect 
association  (e.g.,  if  a  factor  varies  as  the  cube  of  a  metric,  there 
is  still  perfect  association).  Thus,  rank  is  the  basis  of  comparison. 
Therefore,  a  second  way  to  assess  relative  quality  is  as  follows: 

If  the  elements  of  a  metric  vector  M,  corresponding  to  components 
1,2,  ...,n,  are  ordered  by  rank,  as  shown  below,  does  this  imply  an 
ordering  of  component  quality? 

Rank[Ml  >  M2 , . . . , >  Mn]  ==>  Monotonically  Increasing  (Decreasing) 
Quality? 

The  validity  criterion  which  assesses  the  degree  to  which  this 
relationship  is  satisfied  is  called  consistency.  A  metric  that  is 
validated  according  to  this  criterion  is  used  to  compare  ranks  of  a 
metric  obtained  from  different  components  to  order  the  quality  of  a 
set  of  components. 


UALITY  CONTROL 

iscrirainative  Power 

Metrics  are  used  to  monitor  the  condition  of  a  component  to 
etermine  whether  the  component  appears  to  be  out  of  tolerance.  This 
,s  defined  to  be  a  component  whose  quality  is  below  standard.  This 
implies  that  critical  values  of  metrics  must  be  established  prior  to 
he  monitoring  activity  for  comparing  against  the  measured  values 
jerived  from  the  component. 

In  order  to  control  quality  during  the  design  phase,  components  are 
entified  which  appear  to  have  unacceptable  quality.  Unacceptable 
uality  may  be  manifested  as  excessive  complexity,  inadequate 
ocumentation,  lack  of  traceability,  or  other  undesirable  attributes. 
he  existence  of  such  conditions  is  an  indication  that  the  software 
ay  not  satisfy  quality  requirements  when  it  becomes  operational. 
ince  many  of  the  quality  factors  which  are  usually  of  interest  (e.g., 
eliability) ,  cannot  be  measured  during  design,  and  are  only  available 
uring  test  and  operation,  validated  metrics  are  used  when  quality 
actors  are  not  available.  Validated  metric  measurements  are  compared 
ith  the  critical  values  of  the  metrics.  Components  whose  measurements 
re  greater  than  (or  less  than)  the  critical  values  are  flagged  for 
etailed  inspection.  Depending  on  the  results  of  the  inspection, 
omponents  are  redesigned,  scrapped,  or  not  changed.  The  fact  that  a 
easurement  is  outside  the  critical  value  does  not  recessarily  mean 
hat  the  component  will  exhibit  unacceptable  quality  during  operation; 
ather,  it  is  a  warning  that  the  condition  bears  investigation.  This 
oncept  is  illustrated  in  Figure  1  for  metric  vector  M  for  components 
,2,...,n.  The  role  of  metrics  validation  for  this  use  of  quality 
ontrol  is  to  identify  a  critical  value  of  a  metric,  where  that  metric 
as  been  validated  against  a  quality  factor  on  a  previous  similar 
iroject.  Then  the  metric  can  serve  as  a  substitute  to  identify 
nacceptable  quality  during  design.  Such  a  metric  satisfies  the 
.iscrirainative  power  validity  criterion. 


Mn 
M5 


Unacceptable  Region 
Critical  Value  of  Metric 
Acceptable  Region 


Ml  M4 

M2    M3 


Design  Phase  (Project  Time  >) 

figure  1.  Application  of  Metrics  to  Quality  Control  (discriminative 
xwer) 


Tracking 

In  addition  to  component  quality  lying  within  acceptable  bounds,  a 
desirable  condition  is  for  quality  to  improve  over  the  life  of  the 
component  (i.e.,  a  component  should  exhibit  quality  growth).  Thus, 
during  all  phases  of  the  life  of  the  component  we  wish  to  track 
guality  in  order  to  control  quality.  That  is,  we  want  to  know  whether 
the  software  is  getting  better,  worse,  or  staying  the  same.  Again,  in 
most  phases,  the  guality  factor  will  not  be  available  but  we  must  know 
how  quality  might  be  changing,  nevertheless.  This  concept  is 
illustrated  in  Figure  2  for  metric  vector  M  for  a  given  component  i, 
measured  at  times  Tl ,  T2,...,Tn.  In  this  illustration,  quality 
increases  from  Tl  to  T2 ,  stays  the  same  from  T2  to  T3 ,  and  decreases 
from  T3  to  Tn,  assuming  high  metric  values  are  'bad'.  Here,  the 
question  for  metrics  validation  is  whether  a  metric  can  be  identified 
whose  changes  over  time  will  track  changes  in  quality.  In  particular, 
if  a  metric  has  been  validated  as  tracking  a  quality  factor  on  a 
previous  similar  project,  it  would  serve  as  a  substitute  for  tracking 
quality  on  the  given  project.  Such  a  metric  satisfies  the  tracking 
validity  criterion. 


M 


Mi 
Mi 
Mi  Mi 

Mi    Mi 


Tl    T2    T3    T4    T5    Tn 

Project  Time   > 

Figure  2.  Application  of  Metrics  to  Quality  Control  (tracking) 

QUALITY  PREDICTION 

Predictability 

During  the  design  phase  validated  metrics  are  used  to  make 
predictions  of  test  or  operational  phase  quality  factors.  Predicted 
values  of  quality  factors  are  compared  with  target  values.  Components 
whose  predicted  quality  factor  values  are  greater  than  (or  less  than) 
the  target  values  are  flagged  for  detailed  inspection.  Potentially, 
prediction  is  more  valuable  than  assessment  and  control  because  it 
estimates  the  attribute  of  ultimate  interest  --  the  quality  factor. 
However,  prediction  is  more  difficult  because  it  involves  using 
validated  metrics  from  an  early  phase  (e.g.,  design)  to  make 
predictions  about  a  different  but  related  attribute  (quality  factor) 
in  a  much  later  phase  (e.g.,  operations).  This  concept  is  illustrated 
in  Figure  3  where,  at  time  Tl ,  metric  M  is  used  to  predict  the  factor 
Fp  at  time  T2 ,  for  a  given  component,  and  Fa  is  eventually  observed  as 
the  actual  value  at  T2 .  The  challenge  to  metrics  validation  is  to  find 
a  metric  or  metrics  that  can  predict  a  quality  factor  with  acceptable 
accuracy.  Such  a  metric  satisfies  the  predictability  validity 
criterion. 


FaT2 

Fp 

T2 

f   =   f(M-M) 

Design  Operations 

Tl  T2 

igure  3.  Application  of  Metrics  to  Quality  Prediction (predictability) 

NON-PARAMETRIC  STATISTICAL  METHODS  FOR  METRICS  VALIDATION 

Among  the  advantages  of  non-parametric  statistical  methods  over 
parametric  methods  [5,6,8]  which  are  important  for  metrics  validation, 
are  the  following: 

3  Assumptions  less  restrictive  than  with  parametric  methods.  Given  the 
loisiness  of  metrics  data,  this  is  a  big  plus. 

d  No  assumption  about  distribution  (e.g.,  data  does  not  have  to  be 
lormally  distributed). 

p  Can  use  ordinal  scale  (i.e.,  component  A  is  higher  quality  than 
component  B) . 

d   Can  use  nominal  scale  (i.e.,  A  is  high  quality;  B  is  low  quality) 

d  Do  not  need  interval  scale  (i.e.  difference  between  A  quality  and  B 
3uality) . 

d  Do  not  need  ratio  scale  (i.e.,  A  is  2.5  the  quality  of  B). 

For  example,  ranks  of  random  variables  [3]  can  be  used  rather  than 
the  values  themselves,  thus  relaxing  the  assumptions  about  data 
relationships  (e.g. ,  linearity)  while  providing  a  measure  of  quality 
(e.g.,  ranking  of  components)  that  is  useful  to  the  software  manager. 
In  other  words  the  fact  that  the  data  is  not  as  "well-behaved '  as  we 
■night  believe  it  should  be  does  not  necessarily  mean  that  it  is  less 
useful.  In  fact,  when  we  consider  that  many  useful  applications  of 
metrics  can  be  derived  from  the  ability  to  classify  components  as 
being  "better  or  'worse',  "high  quality'  or  "low  quality',  acceptable 
or  unacceptable,  we  realize  that  the  information  provided  by  non- 
parametric  analysis  is  supportive  of  this  approach. . 

Multivariate  statistical  methods  (e.g.,  correlation  analysis, 
factor  analysis)  are  also  used  where  appropriate. 


8 

PURPOSE  OF  METRICS  VALIDATION 

The  purpose  of  metrics  validation  is  to  identify  metrics  that  are 
related  to  quality  factors.  If  metrics  are  to  be  useful,  they  must 
indicate  accurately  whether  quality  requirements  have  been  achieved  or 
are  likely  to  be  achieved  in  the  future.  When  it  is  possible  to 
measure  quality  factors  at  the  desired  point  in  the  life  of  the 
software,  they  are  used  to  evaluate  software  quality.  At  other  points, 
certain  quality  factors  (e.g.,  reliability)  are  not  available;  they 
are  obtained  after  delivery  or  late  in  the  project.  In  these  cases, 
metrics  are  used  early  in  a  project  to  assess,  control  and  predict 
quality. 

It  is  important  that  metrics  be  validated  before  they  are  used  to 
evaluate  software  quality.  Otherwise,  metrics  may  be  misapplied  (i.e., 
metrics  may  be  used  that  have  little  or  no  relationship  to  the  desired 
quality  characteristics). 

VALIDITY  CRITERIA 

To  be  considered  valid,  a  metric  must  demonstrate  a  high  degree  of 
association  with  the  quality  factor  it  represents.  A  metric  may  be 
valid  with  respect  to  certain  validity  criteria  and  invalid  with 
respect  to  other  criteria. 

The  validation  procedure  requires  that  threshold  values  of  validity 
criteria  be  selected.  These  are  the  values  ' V  , "  B',  ~A'  ,  and  ~P' 
which  are  described  below.  The  criterion  used  for  selecting  these 
values  is  reasonableness  (i.e.,  judgement  must  be  exercised  in 
selecting  values  to  strike  a  balance  between  the  one  extreme  of 
causing  a  metric  which  has  a  high  degree  of  association  with  a  quality 
factor  to  fail  validation  and  the  other  extreme  of  allowing  a  metric 
of  questionable  validity  to  pass  validation). 

A  short  numerical  example  follows  the  definition  of  each  validity 
criterion. 

Note:  As  previously  stated,  there  are  many  advantages  to  using  the 
general  class  of  non-parametric  statistical  methods  for  metrics 
validation.  However,  although  the  specific  methods  that  are  associated 
with  each  validity  criterion  are  appropriate,  they  are  not  necessarily 
the  only  methods  that  could  be  used. 

Associativity:  The  variation  in  the  quality  factor 
explained  by  the  variation  in  the  metric,  which  is  given  by 
the  square  of  the  linear  correlation  coefficient  (R )  between 
the  metric  and  the  corresponding  quality  factor,  must  exceed 

V  (  R2  >  V). 

This  criterion  assesses  whether  there  is  a  sufficiently  strong 
linear  association  between  a  quality  factor  and  a  metric  to  warrant 
using  the  metric  as  a  substitute  for  the  quality  factor,  when  it  is 
infeasible  to  use  the  latter.  This  criterion  supports  the  quality; 
assessment  function.  The  multivariate  statistical  methods  of  linear 
correlation  and  partial  linear  correlation  analysis  [15]  can  be  used 
for  thi:  test. 


For  example,  the  correlation  coefficient  between  a  complexity 
nitric  and  the  quality  factor  reliability  may  be  .9.  The  square  of 
tiis  is  .81.  Thus  81%  of  the  variation  in  the  quality  factor  is 
ecplained  by  the  variation  in  the  metric.  If  this  relationship  is 
c;monstrated  over  a  representative  sample  of  components,  and  if  V  has 
ben  established  as  .7,  one  could  conclude  that  the  metric  has  the 
i^ility  to  associate  complexity  with  reliability  and  can  be  used  to 
compare  magnitudes  of  complexity  obtained  from  different  components  to 
estimate  the  degree  to  which  they  differ  in  reliability. 

Consistency:   If   a   quality   factor   vector   Fl,   F2 ,   ...,   Fn, 
corresponding  to  components  1,  2,  ...,  n,  has  the  relationship  Fl  >  F2 
...,  Fn,  the  corresponding  metric  vector  must  have  the  relationship 
>  M2  >  . . . ,  Mn. 

This  criterion  assesses  whether  there  is  consistency  between  the 
janks  of  the  quality  factor  and  the  ranks  of  the  metric  for  the  same 
jet  of  components.  Thus  this  criterion  is  used  to  determine  whether  a 
istric  can  accurately  rank,  by  quality,  a  set  of  components.  This 
iriterion  supports  the  quality  assessment  function.  The  non-parametric 
itatistical  method  Spearman  Rank  Correlation  [3,5,6,8]  can  be  used  for 
his  test. 

For  example,  if  the  reliability  of  components  A,  B  and  C,  as 
leasured  by  MTTF,  is  1000,  1500  and  800  hours,  respectively,  and  the 
(orresponding  complexity  metric  values  are  5,  3  and  7,  where  low 
jetric  values  are  ^better'  than  high  values,  the  ranks  for  reliability 
,nd  metric  values,  with  '1'  representing  the  * highest'  rank,  are  as 
ollows: 

Reliability   Complexity 
Component     Rank  Rank 

B         1  1 

A         2  2 

C  3  3 

If  this  relationship  is  demonstrated  over  a  representative  sample 
f  components,  one  could  conclude  that  the  metric  is  consistent  and 
an  be  used  to  rank  the  quality  of  components. 

Discriminative  Power:  A  metric  must  be  able  to  discriminate  between 
igh  quality  components  (e.g.,  high  MTTF)  and  low  quality  components 
e.g.,  low  MTTF).  For  example,  the  set  of  metric  values  associated 
ith  the  former  should  be  significantly  higher  (or  lower)  than  those 
ssociated  with  the  latter. 

This  criterion  assesses  whether  a  metric  is  capable  of  separating  a 
et  of  high  quality  components  from  a  set  of  low  quality  components, 
'his  capability  allows  one  to  establish  critical  values  for  metrics 
'hich  can  be  used  to  identify  components  which  may  have  unacceptable 
;uality.  This  criterion  supports  the  quality  control  function.  The 
ollowing  non-parametric  statistical  methods  can  be  used  for  this 
alidation  test:  Mann-Whitney  Test  [4,5,6,8],  chi-square  test  for 
differences  in  probabilities  (contingency  tables)  [5,8]  and  the 
Irusal-Wallis  Test  [4,5,6,8].  The  multivariate  statistical  method 
liscriminant  analysis  [1,15]  can  also  be  used. 
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For  example,  if  all  components  with  a  complexity  metric  value  of 
>10  (critical  value)  have  a  MTTF  of  1000  hours  and  all  components  with 
a  complexity  metric  value  equal  to  or  less  than  10  have  a  MTTF  of  2000 
hours,  and  this  difference  is  sufficient  to  pass  the  statistical 
tests,  then  the  metric  separates  low  from  high  quality  components.  If 
the  ability  to  discriminate  is  demonstrated  over  a  representative 
sample  of  software  components,  one  could  conclude  that  the  metric  can 
discriminate  between  low  and  high  reliability  components. 

Tracking:  If  a  metric  M  is  directly  related  to  a 
quality  factor  F,  for  a  given  component,  then  a  change  in 


a   quality   factor   value   from   FT1   to   FT2, 


times  Tl  and  T2,  must  be  accompanied  by  a  change  in  metric 
value  from  MT1  to  MT2,  which  is  the  same 
direction  (e.g.,  if  F  increases,  M  increases).  If  M  is 
inversely  related  to  F,  then  a  change  in  F  must  be 
accompanied  by  a  change  in  M  in  the  opposite  direction 
(e.g.,  if  F  increases,  M  decreases). 

This  criterion  assesses  whether  a  metric  is  capable  of  tracking 
changes  in  quality  over  the  life  of  a  component.  This  criterion 
supports  the  quality  control  function.  The  following  non-parametric 
statistical  methods  can  be  used  for  this  validation  test:  Spearman 
Rank  Correlation  and  Wald-Wolf owitz  Runs  Test  (test  for  randomness) 
[5,8]. 

For  example,  if  a  complexity  metric  is  claimed  to  be  a  measure  of 
reliability,  then  it  is  reasonable  to  expect  a  change  in  :he 
reliability  of  a  component  to  be  accompanied  by  an  appropriate  change 
in  metric  value  (e.g.,  if  the  component  increases  in  reliability,  the 
metric  value  should  also  change  in  a  direction  that  indicates  the 
component  has  improved).  That  is,  if  MTTF  is  used  to  measure 
reliability  and  is  equal  to  1000  hours  during  testing(Tl)  and  1500 
hours  during  operation  (T2),  a  complexity  metric  whose  value  is  8  in 
Tl  and  6  in  T2 ,  where  6  is  "  better'  than  8  (i.e.,  complexity  has 
decreased),  is  said  to  track  reliability  for  this  component.  If  this 
relationship  is  demonstrated  over  a  representative  sample  of 
components,  one  could  conclude  that  the  metric  can  track  reliability 
(i.e.,  indicate  changes  in  component  reliability)  over  the  life  of  the 
component. 


Predictability:  If  a  metric  is  used  at  time  Tl  to 
predict  a  quality  factor  for  a  given  component,  it  must 
predict  a  related  quality  factor  Fp  ^  with  an 

accuracy  of: 


FaT2  -  FpT. 


^T2 


<    A 


where  FaT2  is  the  actual  value  of  F  at  time  T2. 
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This  criterion  assesses  whether  a  metric  is  capable  of  predicting  a 
:iality  factor  value  with  the  required  accuracy.  It  is  simply  a 
^lative  error  calculation  [2,6],  that  takes  into  consideration  the 
t.me  of  measurement.  The  multivariate  statistical  methods  of  linear 
agression,  multiple  linear  regression,  and  non-linear  regression  can 
fc  used  for  this  analysis. 


. 


For  example,  if  a  complexity  metric  is  used  during  design  to 
edict  the  reliability  of  a  component  during  operation  (T2)  to  be 
.00  hours  MTTF  (Fp  )  and  the  actual  MTTF  that  is  measured 

jnng  operation  is  1000  hours  (FaT2),  then  the  error  in 

ediction  is  200  hours,  or  207..  If  the  acceptable  prediction  error 

)  is  257.,  prediction  accuracy  is  acceptable.  If  the  ability  to 

l^edict  is  demonstrated  over  a  representative  sample  of  components, 

le  could  conclude  that  the  metric  can  be  used  as  a  predictor  of 
liability.  For  example,  prediction  could  be  used  during  design  to 

entify  those  components  that  need  to  be  improved. 

Repeatability:  A  metric  must  demonstrate  the  above  associativity, 
msistency,  discriminative  power,  tracking,  and  predictability 
roperties  for  P  percent  of  the  applications  of  the  metric. 

This  criterion  is  used  to  ensure  that  a  metric  has  passed  a 
•alidity  test  over  a  sufficient  number  or  percentage  of  applications 
o  that  there  will  be  confidence  that  the  metric  can  perform  its 
ntended  function  consistently. 

For  example,  if  the  required  'success  rate'  (P)  for  validating  a 
omplexity  metric  against  the  Predictability  criterion  has  been 
stablished  as  80%,  and  there  are  100  components,  ;he  metric  must 
redict  the  quality  factor  with  the  required  accuracy  for  at  least  80 
f  the  components. 

VALIDATION  PROCEDURE 

Metrics  validation  includes  the  following  steps: 

Identify  the  Quality  Factors  Sample 

Draw  a  random  sample  from  the  metrics  database. 

»  Identify  the  Metrics  Sample 

Draw  a  random  sample  from  the  same  domain  (e.g.,  same  software)  of 
he  metrics  data  base. 

>  Perform  Goodness  of  Fit  Tests 

Perform  goodness  of  fit  tests  on  the  quality  factor  and  metrics 
ata  to  identify  their  distributions. 

>  Perform  a  Statistical  Analysis 

Perform  a  statistical  analysis  using  the  methods   listed  under 
Validity  Criteria. 

•>  Rf»-val  idate  Metrics 
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Metrics  validation  is  a  continuous  process.  It  is  important  to 
revalidate  a  metric  each  time  it  is  used.  As  the  software  engineering 
process  changes,  the  validity  of  metrics  changes.  A  validated  metric 
may  not  necessarily  be  valid  in  other  environments  or  future 
applications.  A  metric  that  has  been  invalidated  may  be  valid  in  other 
environments  or  future  applications. 

o  Validate  and  Apply  Metrics  in  Similar  Environments 

There  have  been  great  disparities  in  results  reported  in  the 
literature  concerning  ~ relationships '  between  metrics  and  the 
guantities  they  purport  to  measure.  For  example,  correlation 
coefficients  of  number  of  errors  with  Halstead  Effort  and  McCabe 
Complexity  differ  by  a  factor  of  almost  two  [11].  Differences  have 
also  been  reported  with  respect  to  specification  refinement  levels 
[10].  These  disparities  point  up  the  need  to  apply  metrics  under 
conditions  that  are  similar  to  those  used  to  validate  the  metrics. 

There  should  be  a  project  in  which  metrics  data  have  been  collected 
and  validated  prior  to  application  of  the  metrics.  This  project  should 
be  similar  to  the  one  in  which  the  metrics  are  applied,  with  respect 
to  application,  project  size,  software  engineering  environment,  design 
methodology,  and  programming  language.  In  other  words,  to  the  extent 
possible,  conduct  a  controlled  experiment  [6].  Validation  and 
application  of  metrics  should  be  performed  during  the  same  phases  on 
different  projects.  Example:  if  metric  X  is  collected  during  the 
design  phase  of  project  A  and  the  saved  values  are  later  validated 
with  respect  to  guality  factor  Y,  which  is  collected  during  the 
operations  phase  of  project  A,  the  metric  X  should  be  used  during  the 
design  phase  of  project  B  to  assess  guality  factor  Y  with  respect  to 
the  operations  phase  of  project  B. 

EXAMPLE  OF  VALIDATING  METRICS 

The  following  example  is  provided  to  show  how  to  make  metric 
validation  tests.  No  inferences  should  be  drawn  from  this  example 
regarding  the  validity  of  these  metrics  for  other  applications.  These 
metrics  are  used  for  illustrative  purposes  only.  The  results  of  the 
validation  tests  could  be  different  for  other  applications.  The  data 
used  in  the  validation  tests  were  collected  from  actual  software 
projects . 

Purpose  of  Metrics  Validation 

The  purpose  of  this  validation  is  to  determine  whether  cyclomatic 
number  (complexity  (O)  and  size  (number  of  source  statements  (S)) 
metrics,  either  singly  or  in  combination,  could  be  used  to  assess, 
control  and  predict  the  guality  factor  reliability,  as  represented  by 
the  guality  factor  error  count  (E). 

validity  criteria 

Select  values  of  V,  B,  A,  and  P.  The  values  of  V,  B,  A,  and  P,  used 
in  the  example  are  .7,  .7,  20%,  and  80%,  respectively. 
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VALIDATION  PROCEDURE 

Perform  the  following  validation  steps: 

dentify  the  Quality  Factor  Sample 

Draw  a  random  sample  of  procedures  (i.e.,  components),  which  is 
ummarized  in  Table  1,  from  the  metrics  data  base,  for  the  quality 
actor  reliability,  which  is  represented  by  the  quality  factor  error 
:ount  (Errors).  The  error  counts  are  listed  by  project  and  procedure 
n  Appendix  A. 


dentify  the  Metrics  Sample 

Using  the  same  procedures  (i.e.,  components)  in  Table  1,  identify 
he  metrics  samples  for  cyclomatic  number  (complexity)  and  size 
statements) .  The  metrics  values  are  listed  by  project  and  procedure 
n  Appendix  A. 

Table  1 

Project  Application 


String  Processing 
Directed  Graph  Analysis 
Directed  Graph  Analysis 
Data  Base  Management 


Procedures 

Statements 

Errors 

(with  errors) 

11  (  5) 

136 

10 

31  (12) 

430 

27 

1  (  1) 

13 

1 

69  (13) 

1021 

26 

112  (31) 


1600 


64 


dumber  of  procedures:  112  total,  31  with  errors,  81  with  no  errors. 

dumber  of  source  statements:  2007  total,  1600  included  in  metrics 

analysis. 

Language 

Programmer 


Pascal  on  all  projects. 

Single  programmer.  Same  programmer  on  all  projects 


Perform  Goodness  of  Fit  Tests 


The   best 
iistributions 


fits   obtained   for   the   data   are   the   following 


Errors:      Negative  Binomial  (error  procedures) 
Complexity:  Negative  Binomial  (all  procedures) 
Statements:  Exponential       (all  procedures) 

Thus,  this  result  discourages  the  use  of  statistical  methods  that 
depend  on  assumptions  of  normality  and  encourages  the  use  of  non- 
parametric  methods. 

Perform  a  Statistical  Analysis 

Perform  the  tests  described  under  Validity  Criteria.  Significance 
evel  and  sample  size  are  denoted  by  a  and  N,  respectively;  when  it 
s  necessary  to  specify  a  critical  level  of  a  in  hypothesis  tests, 
05  is  used. 


Associativity 

1.  Compute    the   sample   linear   correlation   coefficient   (R)   for   Errors 
(E)   and   Complexity   (C)   and   for   Errors   (E)   and   Statements   (S)   and 

compare   each   R~  with   V   =   .7    [15]. 

Table   2 
Sample   Correlations   (Error  Procedures) 
N   =   31 
Complexity  Statements 

Errors  .7834  .5880 

a  .0000  .0005 

Sample   Correlations    (All   Procedures) 
N   =    112 
Complexity  Statements 

Errors  .8010  .6596 

a  .0000  .0000 

RESULT:   R2  <    V   -   .7.   Fails   minimum   R2   tests. 

2.  Perform  a   null   hypothesis   test  Hn:   p   =   0   for  E   and  C.   [15]. 
RESULT:   Reject    H0   with   a    =    .0000   and   N    =    31. 

3.  Perform    a    null     hypothesis    test    H(J:    p     >     JV    =    .836    for    E 
and  C,  since   we   want  R~      >   V   =   .7   (I5J. 

RESULT:   Accept   H0   with   a   =    .01    and   N   =    31. 

4.  Compute  the  partial  correlation  coefficients  for  E,  C,  and  S.  The; 
coefficients  give  the  strength  of  the  linear  relationship  between  t\ 
variables  while  controlling  for  the  effects  of  the  remaining  variabl< 
[15].  This  is  a  method  for  controlling  for  the  effect  of  size  (i.e 
when  the  partial  correlation  coefficient  between  E  and  C  is  compute( 
the  effect  of  S  is  eliminated  so  that  the  association  between  E  and 
alone   can   be   observed). 

Table    3 

Sample  Partial  Correlations  (Error  Procedures) 
N  =  31 

Complexity     Statements 

Errors         0.64298       -0.08157 

Complexity  0.65568 
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^SULT:  From  the  low  R  for  E  and  S,  it  can  be  seen  that  Statements 
untributes  essentially  no  additional  information  about  Errors,  once 
Umplexity  has  been  correlated  with  Errors.  Also,  the  R  for  E  and  C 
Ldicates  the  correlation  between  Errors  and  Complexity  with  the 
3 feet  of  size  (S)  eliminated. 

:  Compute  a  confidence  interval  of  p  for  E  and  C  [15J. 

iSULT:  .593  <  p  <  .891  with  a  =  .05  and  N  -  31. 

Issts  3 ,  4  and  5  provide  additional  useful  information  about  linear 
~>rrelation  but  they  are  not  part  of  the  required  validation 
p'ocedure. 

I    Perform  a  Factor  Analysis 

)te:  In  this  section  a  factor  is  defined  as  follows: 

=     XlJFl+X2jF2+-+XkJFk  +  UP    Where 

is   a   variable    (metric),  F|,F2,...,Fk   are    factors    that   are 
mmon   to   all    the   variables,  U,    is   a    random   factor   unique    to   Xr 

id  \jj,\2j»— Aj?j  are   ,actor  loadings 

orrelations    between    variables    and    factors)    [9,15]. 

Do  not  confuse  the  use  of  the  statistical  term  ~ factor'  with  the 
;e  of  the  metrics  terra  'quality  factor'. 

The  objective  of  factor  analysis  is  to  reduce  a  set  of  metrics  to  a 
nailer,  orthogonal  set  of  factors  that  can  better  explain  the 
slationships  between  correlated  metrics.  It  frequently  occurs  that 
bveral  'independent'  variables  (Complexity,  Statements)  that  are  used 

study  the  behavior  of  a  dependent  variable  (Errors)  are  themselves 
^pendent  and  correlated  --  the  multicollinearity  problem  (See  Table 
Recent  studies  [14,17]  have  shown  that  a  large  number  of  metrics 
L6]  can  be  reduced  to  a  small  manageable  set  that  represents  the 
iderlying  relationship  between  the  quality  factor  and  one  or  more 
2trics.  The  method  is  most  useful  when  there  are  many  metrics.  The 
<ample  that  follows  only  involves  three  metrics.  The  mechanics  of  the 
lalysis  are  to  attempt  to  identify  one  or  more  factors  that  contain 
Lgh  loadings  for  a  subset  of  the  metrics  in  the  factor,  including  the 
lality  factor,  and  low  loadings  for  the  remaining  metrics.  Then  the 
ladings  are  examined,  excluding  the  quality  factor,  to  see  which 
2trics  of  the  candidate  factors  from  the  first  step  have  high 
Dadings.  These  metrics  would  be  emphasized  in  certain  other  analyses, 
Ike  regression  analysis.  The  remaining  metrics  would  be  deemphasized 
r  discarded.  An  example  is  shown  in  Table  4,  for  procedures  with 
rrors ,  where  Factor  2  contains  relatively  high  loadings  for  'Errors' 
id  'Complexity' .  Table  5  shows  a  relatively  high  loading  for 
lomplexity' .  This  analysis  indicates  that  a  single  metric 
Dmplexity  --  suffices  for  explaining  the  variance  in  the  Errors 
Btric.  A  similar  result  was  obtained  using  all  procedures. 
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Table  4 
Factor  Loadings  (Error  Procedures) 
N  =  31 


Metric 


Factor  1 


Factor  2 


Errors 

Complexity 

Statements 


0.30128 


0.68442 


0.94057 


0.94013 


0.66467 


0.29572 


Table  5 
Factor  Loadings  (Error  Procedures) 
N  =  31 


Metric 


Factor  1 


Factor  2 


Complexity 


Statements 


0.44002 


0.89799 


0.89799 


0.44002 


CONCLUSION:  The  results  are  mixed.  Although  the  results  of  tests  2,  3 
and  5  are  favorable,  Complexity  failed  mandatory  Test  1.  Thus, 
evaluating  the  results  conservatively,  Complexity  is  judged  to  be 
invalid  with  respect  to  Associativity.  Statements  does  not  perform  as 
well  as  Complexity  and  is  invalid  with  respect  to  Associativity. 
Furthermore  the  factor  analysis  indicates  that  only  one  of  the  metrics 
--  Complexity  --  is  needed. 

Consistency 

1.  Compute  the  Spearman  Coefficient  of  Rank  Correlation  (r)  for  E 
and  C  over  all  procedures  with  errors.  Correlation  is  lower  for  E 
and  S  than  for  E  and  C  and  is  not  shown.  Compare  r  with  B  =  .7  and 
a   with   .05    [5,8]. 

Table   b 
Spearman   Rank   Correlation    (Error   Procedures) 
N   =   31 
Complexity  Remarks 

Errors  .5119  r   <    .7 

a  .0051  a   <    .05 


RESULT:  The  desired  result  is  r  >  .7  and  a  <  .05.  Complexity  does 
not  change  consistently  with  changes  in  Errors  across  all  procedures 
with  errors.  Therefore  Complexity  is  not  valid  with  respect  to 
Consistency.  Aiso,  Statements  is  not  valid  with  respect  to 
Consistency. 
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ILscrimitative   Power 


,. 


Divide  the  data  into  two  sets:  procedures  with  errors  and 
jrocedures  with  no  errors.  Rank  these  sets  according  to  their  C  and  S 
ulues  (statistical  programs  will  do  the  ranking  automatically)  and 
prform  the  Mann-Whitney  test  to  see  whether  C  and  S  can  discriminate 
tetween  the  two  sets  of  procedures  (i.e.,  tell  the  difference  between 
Ugh  guality  and  low  guality  software)  [5,8]. 

BSULT:  The  results  of  the  Mann-Whitney  test  for  C  and  S  are  shown  in 
!able  7.  The  average  ranks  of  C  and  S  for  procedures  with  errors  are 
uch  higher  than  the  average  ranks  for  procedures  with  no  errors, 
3=spectively.  We  can  infer  from  the  low  probabilities  of  higher 
jtatistics  that  C  and  S  for  procedures  with  errors  have  significantly 
ligher  medians  in  the  populations  (i.e.,  that  C  and  S  could 
dscriminate  apriori  between  high  guality  and  low  quality  software), 
taution:  a  large  number  of  ties  weakens  this  test.  There  are  a  large 
jumber  of  ties  in  C  but  not  in  S  [5,8]. 

Table  7 
Mann-Whitney  Test:  Comparison  of  Two  Samples 

ample  1:  Complexity  -  Procedures  with  errors 

ample  2:  Complexity  -  Procedures  with  no  errors 

verage  rank  of  first  group  =  85.9032  based  on  31  values. 
Verage  rank  of  second  group  =  45.2469  based  on  81  values, 
arge  sample  test  statistic  Z  =  -6.30181 

wo-tailed  probability  of  equaling  or  exceeding  Z  =  2.^5465E-10 
112  total  observations. 

ample  1:  Statements  -  Procedures  with  errors 

sample  2:  Statements  -  Procedures  with  no  errors 

jverage  rank  of  first  group  =  85.2419  based  on  31  values, 
verage  rank  of  second  group  =  45.5  based  on  81  values, 
arge  sample  test  statistic  Z  =  -5.82408 

wo-tailed  probability  of  equaling  or  exceeding  Z  =  5.76106E-9 
112  total  observations. 

.  Divide  the  data  into  four  categories,  as  s.iown  in  Table  8, 
cording  to  a  critical  value  of  C,  Cf,  so  that  a  Chi-square  test 
an  be  performed  to  determine  whether  Cf  can  discriminate  between 
rocedures  with  errors  and  those  with  no  errors.  Ct  is  chosen  to 
rovide  at  least  five  observations  for  each  cell  in  Table  8  in  order 
3  ensure  the  validity  of  the  test.  This  may  involve  trial  and  error 
)]. 
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No  Errors 


Errors 


Table  8 


Contingency  Table 


Complexity 
<   3 


Complexity 
>   3 


75 


10 


85 


21 


27 


81 


31 


112 


RESULT:  The  result  of  the  Chi-square  test  is  shown  in  Table  9.  From 
the  high  value  of  chi-square  and  the  very  small  significance  level 
in  the  samples,  we  infer  that  Cc  could  discriminate  between 
procedures  with  errors  (low  quality  software)  and  those  without 
errors   (high   quality   software). 

Table    9 


Summary   Statistics   for   Contingency   Tables:   Cc   =   3 


Chi-square 
44.6081 


D.F.     Significance 
1      2.40692E-11 


Sensitivity  Analysis  of  Critical  Value  of  Complexity 

In  order  to  see  how  good  a  discriminator  Cc  is  for  this 
example,  we  observe  the  number  of  misclassifications  that  result  for 
various  values  of  Cr:  1  )  Type  1  (  "error  procedures'  classified  as 
no  error  procedures')  and  2)  Type  2  ('no  error  procedures' 
classified  as  "error  procedures').  This  is  shown  in  Figure  4.  As 
Cr  increases,  Type  1  misclassifications  increase  because  an 
increasing  number  of  high  complexity  procedures,  many  of  which  have 
errors,  are  classified  as  having  "no  errors'.  Conversely,  as  Cf 
decreases,  Type  2  misclassifications  increase  because  an  increasing 
number  of  low  complexity  procedures,  many  of  which  have  no  errors, 
are  classified  as  having  'errors'.  The  total  of  the  two  curves 
represents  the  misclassihcation'  function.  It  has  a  minimum  at 
Cr  =  3,  which  is  the  value  given  by  the  Chi-square  test  (the 
Chi-square  test  will  not  always  produce  the  optimum  C(  but  the 
value   should   be   close   to   optimum). 
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The    foregoing   analysis   assumes   that   the   costs   of   Type    1    and   Type 

misclassif ications    are    equal.    This    is    usually    not    the    case    since 

le    consequences    of     not    finding     an    error    (i.e.,    concluding     that 

iere    is    no    error    when,    in    fact,    there    is    an    error)    would    be    higher 

lan    the    other    case    (i.e.,    concluding    that    there    is    an    error    when, 

1    fact,    there    is    no    error).    In    order    to    account    for    this    situation, 

Te     number    of     Type     1     misclassif  ications,    for    given    values    of     C(, 

multiplied    by    C1/C2    (CI/C2    =     1,    2,    3,    4,    5),    which    is    the    ratio 

the     cost    of     Type     1     misclassif  ication     to     the    cost     of     Type    2 

isclassif ication.    These    values    are    added    to     the    number    of    Type    2 

nsclassification    to    produce    the    family    of    five     cost'    curves    shown 

i      Figure      5.      Naturally,      with      the      higher      cost      of      Type      I 

lisc  lassif  ications     taking     effect,     the     optimum    Cc     (i.e.,    minimum 

ost)    decreases.    However,    even 

hoice. 


at    l/lz    =    :>,    l 


3    is    a    reasonaoie 
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!b.  Do  Step  2a.  for  S.  The  Contingency  Table  is  shown  in  Table  10 

Table  10 


Contingency  Table 
Statements        Statements 


13 


13 


No   Errors 


Errors 


64 


71 


17 


24 


41 


81 


31 


112 


Table    11 


Summary   Statistics    for   Contingency    Tables:    Sf    =    13 


Chi-square 
30.7658 


D.F.     Significance 
1        2.91118E-8 


RESULT:   The   same   comments   made   in   Step   2a.   apply   to   S( 
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Sensitivity  Analysis   of   Critical  Value   of   Size 

The  same  type  of  analysis  is  performed  on  S(  as  was  performed 
on  C  to  see  how  good  S(.  is  as  a  discriminator  of  quality.  The 
curves  of  Type  1,  Type  2  and  total  misclassifications  are  shown  in 
Figure  b,  where  it  is  seen  that  the  optimum  Sc  =  15,  as  opposed  to 
c  =  13,  as  given  by  the  Chi-square  analysis.  The  'cost'  curves  are 
shown  in  Figure  7,  where  again  the  optimal  S(  decreases  as  C1/C2 
increases.  Considering  the  family  of  cost  curves,  Sc  =  13  is  a 
reasonable  value  but  Sf  does  not  perform  as  well  as  Cr  in  this 
example,  because,  whereas  S(  =  15  f  s  not  optimum  for  any  of  the 
cost  curves,  Cr  =  3  is  optimum  for  three  of  the  five  curves.  This 
result  could  be  anticipated  by  the  higher  Chi-square  and  lower  value 
of  significance  (better  ability  to  distinguish  between  high  and  low 
quality)  obtained  for  C  in  Table  9  as  compared  to  the  corresponding 
values   obtained   for   S    in   Table    11. 

3.  Perform  the  Krusal-Wallis  test  (not  shown)  to  ascertain  whether  C 
and  S  are  good  discriminators  with  respect  to  given  values  of  E  (i.e., 
higher   ranks   of   C   and   S   for   higher   values   of   E). 

RESULT:  C  and  S  were  good  discriminators  when  both  procedures  with 
errors   and   all   procedures   were   evaluated. 

Discriminant   Analysis 

Another  approach  to  estimating  and  using  a  critical  value  of  a 
metric  is  to  use  discriminant  analysis  [1,15].  We  briefly  describe 
this  method  more  to  indicate  its  general  potential  than  as  a  method 
that  can  be  applied  in  this  example  because,  unfortunately, 
discriminant  analysis  is  based  on  the  assumption  that  the  random 
variables  are  normally  distributed  [1,15].  This  is  not  the  case  for  E, 
C   and   S,    as   was   observed    from   the   goodness   of    fit    tests. 

In  this  technigue,  a  linear  function  of  random  variables,  called 
the  discriminant  function,  is  found  such  that,  when  this  function  is 
evaluated,  its  value  can  be  used  to  classify  the  random  variables  into 
one   of   N   groups.    For    example,    a    linear    function   of    C   and   S: 

L   =    b(C    +    bsS 

can  be  used  to  classify  the  tupple  (C,S)  into  the  'error'  group  or 
'no  error'  group  depending  on  whether  L  ^  or  <  L  the  cutoff 
value  of  the  discriminant  function.  The  coefficients  of  L  are 
determined  by  maximizing  the  ratio  of  the  variance  between  the  two 
groups  to  the  variance  within  groups,  thus  providing  for  maximum 
discrimination.  Using  Cp  and  Sp  of  the  error'  group,  a 
cutoff  value  Lp  is  determined  for  the  'error'  group. 
Similarly,        using        CnP        and        Sne        of        the  no        error' 

group,  a  cutoff  value  LMP  is  determined  for  the  no  error' 
group.  The  two  values  of  L  are  combined  to  form  a  single  cutoff 
value  L.  A  reasonable  way  to  do  this  is  to  weight  Lp  by 
the  probability  of  a  component  being  in  the  'error'  group  (i.e.,  the 
fraction  of  components  in  the  'error'  group)  and  to  weight 
Lnp  by  the  probability  of  a  component  being  in  the  'no 
error'  g  oup  (i.e.,  the  fraction  of  components  in  the  no  error' 
group). 
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was    the    case    with    factor    analysts,    it    was    found    that    using 

th     C    and    S     was     no     better    than     using     C    or    S    alone     in     the 

criminant     function.     If     a     single     variable     is     used,     a     very 

terestmg     and     useful     result     is     obtained.     In     this     case,     the 

efficients    in   L    become    b   =    1;    then   L   =    C,   L    =   S.   Using    this    result 

L,      produces      two      cutoff      values      Lf      =      Cand      £<. 

Thus  the  mean  values  of  C  =  2.53  or  3  (same  value  as  obtained 
ith   Chi-square)   and   S    =    14.29   or    14    (value   obtained   with   Chi-square 

13)  could  be  used  for  Cf  and  Ss,  respectively.  The  great 
vantage    of    this    approach    over    the    Chi-square    technique    is    that 

and  Ss  can  be  used  directly,  thus  obviating  the  need  for 
lal    and   error   calculations   with   Chi-square. 

INCLUSION:  C  and  S  are  valid  with  respect  to  the  Discriminative 
)wer  criterion  and  either  could  be  used  to  distinguish  between 
xeptable  (C  s  3,  S  *-  13)  and  unacceptable  quality  (C  >  3,  S 
13)  for  this  and  similar  applications  when  this  data  can  be 
)llected.  However,  only  one  is  needed  (i.e.,  C  is  highly  correlated 
ith  S  and  the  correlation  between  E  and  C/S  (normalized)  is  close 
^  0).  It  should  be  noted  that  it  is  less  expensive  to  collect  S 
jian   C. 

racking 

Ideally  we  want  to  track  a  metric  against  a  quality  factor  over 
ime  for  a  single  component  (e.g.,  procedure).  Unfortunately  this  type 
data  is  not  always  readily  available  because  a  time  history  of 
orresponding  quality  factor  and  metric  changes  is  required.  This  data 
as  not  available  in  this  example.  In  lieu  of  this  data,  the  Spearman 
oefficient  of  Rank  Correlation  (r)  can  be  used  as  a  measure  of  the 
rdering  of  the  metric  in  relation  to  the  quality  factor,  with  project 
eing  the  component'  (see  below).  Note,  however,  that  (r)  does  not 
ave  a  chronological  ordering.  Also,  while  (r  =  1)  implies  perfect 
racking,    as  defined  previously,    the  converse   is   not   true. 

.  Compute  the  Spearman  Coefficient  of  Rank  Correlation  (r)  for  E 
nd  C  for  Projects  1,  2,  and  4  separately  (Project  3  is  not  used 
ecause  it  has  only  one  error}.  Correlation  is  lower  for  l  and  S 
han  for  E  and  C  and  is  not  shown.  Compare  (r)  with  B  =  .7  and  a 
/ith  .05.  Procedures  with  errors  are  used  rather  than  all  procedures 
ecause  the  latter  has  too  many  ties  in  the  sample.  Rank  correlation 
hould  not  be  used  when  there  is  a  large  number  of  ties.  A  moderate 
umber  of   ties   is   tolerable   [5,8]. 
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Table   12 

Spearman   Rank   Correlations   (Error  Procedures) 

Project   1,  N   =   5   (small   sample   size) 
Complexity  Remarks 

Errors  .8250  r    >   .7 

a  .0990  a    >    .05 

Project   2,  N   =    12 
Errors  .6723  r   <    .7 

a  .0258  a   <    .05 

Project   4,  N   =    13 

Errors  .2522  r   <    .7 

a  .3824  a    >   .05 

RESULT:  The  desired  result  is  r  >  .7  and  a  <  .05  (i.e.  indication 
of  on-zero  correlation)  for  each  project.  Complexity  does  not  track 
changes  in  Errors  sufficiently  for  any  of  the  projects.  Therefore, 
Complexity  is  not  valid  with  respect  to  Tracking.  Also,  Statements 
is    not   valid   with   respect   to   Tracking. 

2.  Subsequent  to  calculating  (r),  we  were  able  to  observe 
chronologically  the  procedures  which  comprise  a  project,  so  that  for 
this  example  the  project  was  the  'component'  and  the  procedures  that 
comprise  the  project  were  'tracked'.  A  runs  test  was  conducted  for 
Projects  1  and  2  by  assigning  a  '1'  if  M  changed  in  the  same  direction 
as  F  (i.e.  tracks)  and  a  '0'  if  this  was  not  the  case  (does  not 
track).  The  runs  test  determines  whether  the  binary  sequences  (runs) 
are  systematic  (i.e.,  M  tracks  F)  or  would  be  expected  by  chance. 

RESULT:  Projects  1  and  2  failed  (did  not  track)  the  runs  test. 

Predictability 

1.  Make  a  scatter  plot  of  E  and  C  for  procedures  with  errors  to  obtain 
a  rough  analysis  of  linearity  [15]. 

RESULT:  The  dots  on  Figure  8  show  the  relationship. 
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Figure   8     Complexity  Metric 
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Perform  a  linear  regression  analysis  of  E  on  C  for  procedures  with 
rrors. 

a.  Test  whether  the  assumptions  of  linear  regression  analysis  hold 
or  these  data.  Two  of  the  important  assumptions  are:  (1)  E  is 
ormally  distributed  for  given  values  of  C  and  (2)  the  variances  of  E 
re  egual  for  given  values  of  C  [15]. 

ESULT:  For  cases  of  C  =  1  and  2,  where  there  was  an  adequate  sample 
ize,  tests  were  conducted  and  it  was  found  that  neither  assumption 
olds.  In  addition,  E  was  not  normally  distributed  when  all  112 
rocedures  were  used  in  the  analysis.  The  best  fit  for  E  is  a  negative 
inomial  distribution. 

b.  Examine  the  residuals  of  E  (difference  between  observed  and 
redicted  as  a  function  of  C  [15]. 

ESULT:  Residuals  increase  with  increasing  C.  This  indicates  that 
rediction  error  increases  with  increasing  C.  This  is  an  undesirable 
esult  since  we  want  prediction  error  to  be  independent  of  C. 

c.  The  same  results  were  obtained  in  a.  and  b.  when  all  procedures 
rere  used. 

.  Plot  the  regression  model  in  Figure  8  for  E  on  C  for  procedures 
rith  errors.  The  equation  is:  E  =  .151  +  .404C.  The  inner  band  is  the 

5%  confidence  interval  for  average  E  (i.e.,  95%  chance  that,  for  a 
riven  C,  the  estimate  of  average  E  will  fall  within  the  band)  and  the 
•uter  band  is  the  95%  prediction  interval  of  E  (i.e.,  95%  chance  that, 
or  a  given  C,  the  estimate  of  E  will  fall  within  the  band)  [15].  The 
it  is  worse  for  regression  of  E  on  S  (not  shown). 

.  Compare  Observed  Errors  with  Predicted  Errors  (obtained  from 
egression  model)  and  note  whether  Predictability  <  A  =  20%,  for  P 
807.   of    the   predictions. 

.ESULT:  Table  13  indicates  that  Predictability  <  20%   only  11  out  of 
1  cases,  or  35%;  the  result  is  16%  when  all  procedures  are  used  (not 
ihown) .  Fails  Predictability  and  Repeatability  tests. 
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Table 

13 

Observed 

Prediction 

Errors 

Predicted 
Errors 

Error 

Predica- 
bility  (%) 

Proje 

1 

0.957831 

0.04217 

4.21686 

1 

5 

2.572289 

2.42771 

48.55421 

1 

2 

2.168674 

-0.16868 

8.43373 

1 

1 

2.168674 

-1.16867 

116.86747 

1 

1 

0.957831 

0.04217 

4.21686 

1 

1 

0.554216 

0.44578 

44.57831 

2 

1 

0.554216 

0.44578 

44.57831 

2 

1 

0.554216 

0.44578 

44.57831 

2 

3 

0.957831 

2.04217 

68.07228 

2 

3 

3.379518 

-0.37952 

12.65060 

2 

1 

1.765060 

-0.76506 

76.50602 

2 

3 

2.572289 

0.42771 

14.25702 

2 

2 

0.957831 

1.04217 

52.10843 

2 

1 

1.765060 

-0.76506 

76.50602 

2 

2 

2.168674 

-0.16868 

8.43373 

2 

1 

1.765060 

-0.76506 

76.50602 

2 

8 

6.608433 

1.39157 

17.39457 

2 

1 

0.957831 

0.04217 

4.21686 

3 

1 

2.572289 

-1.57229 

157.22891 

4 

1 

2.168674 

-1.16867 

116.86747 

4 

5 

3.379518 

1.62048 

32.40963 

4 

2 

1.361445 

0.63855 

31.92771 

4 

1 

1.361445 

-0.36145 

36.14457 

4 

1 

2.975903 

-1.97590 

197.59036 

4 

1 

2.168674 

-1.16867 

116.86747 

4 

3 

1.765060 

1.23494 

41.16465 

4 

2 

2.168674 

-0.16868 

8.43373 

4 

5 

5.397590 

-0.39759 

7.95180 

4 

1 

1.765060 

-0.76506 

76.50602 

4 

1 

1.765060 

-0.76506 

76.50602 

4 

2 

1.765060 

0.23494 

11.74698 

4 

5.  Try  non-linear  single  independent  variable  regression  models. 

RESULT:  Several  non-linear  (eg.,  exponential)  regressions  of  E  on  C 
for  procedures  with  errors  had  lower  correlation  and  worse  fit  (not 
shown)  than  the  linear  model. 

6.  Perform  multiple  linear  regression  analysis,  using  E  as  dependent 
variable  and  C  and  S  as  'independent  variables'. 

a.  Test  whether  the  assumptions  of  the  multiple  regression  model  hold. 
An  important  assumption  of  this  method  is  that  the  'independent 
variables'  are  actually  independent  [15]. 

RESULT:  The  significant  R  between  C  and  S  of  .833  for  al  procedures 
indicates  dependence. 
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Examine  the  residuals  of  E  for  all  procedures  [15]. 

SULT:  Residuals  increase  with  increasing  C  and  S  indicating  that 
rediction  error  would   increase  with   increasing  C  and  S   -   an 
jndesirable  result. 

Plot  the  multiple  regression  model  and  compare  with  results  of  Step 
[15]. 

RESULT:  The  plots  were  made  but  are  not  shown  because  the  fit  is  worse 
than  in  Step  3.  For  procedures  with  errors  the  regression  equation  is: 
IE  =  .174  +  .437C  -  .00672S.  Statements  contributes  little  to  the 
relationship.  The  comparison  between  simple  and  multiple  regression  is 
summarized  in  Table  14,  where  F-Ratio  is  a  measure  of  goodness  of  fit 
(generally,  high  value  signifies  good  fit)  and  P  is  the  percentage  of 
predictions  that  are  within  the  prediction  error  tolerance  (A  =  20%). 

Table  14 

E  vs.  C         E  vs.  C         E  vs.  C,  S      E  vs.  C,  S 
Error  All  Error  All 

Procedures     Procedures     Procedures     Procedures 

.783  .801  .785  .801 

F-Ratio    46.1  196.9  22.5  97.6 

P  for     35%  16%  35%  22% 

A  <  20% 

CONCLUSION:  Neither  C  nor  S  meets  the  Predictability  criterion,  either 
singly  or  in  combination,  for  predicting  E.  Multiple  regression  has  no 
advantage  over  single  variable  regression  for  these  data.  Also,  the 
assumptions  of  both  models  are  not  satisfied.  Therefore,  both  C  and  S 
are  not  valid  with  respect  to  Predictability. 

Re-validate  Metrics 

Repeat  all  validation  tests  for  C  and  S  on  future  projects,  keeping 
track  of  P,  the  Repeatability  requirement  (i.e.,  percentage  of 
applications  a  metric  must  pass  validity  tests  to  be  certified  as 
valid) . 

Validate  and  Apply  Metrics  in  Similar  Environments 

The  final  result  of  the  validation  exercise  is  that  C  and  S  are 
valid  only  with  respect  to  the  discriminative  power  criterion  to 
support  the  quality  control  function.  To  the  extent  practical,  apply  C 
and  S  in  applications  and  environments  on  future  projects  that  are 
similar  to  this  one. 
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SUMMARY  AND  FUTURE  RESEARCH 

We  described  a  comprehensive  metrics  validation  methodology  that 
has  six  validation  criteria,  each  of  which  supports  certain  quality 
functions.  New  criteria  were  defined  and  illustrated,  including 
consistency,  discriminative  power,  tracking  and  repeatability.  It  was 
shown  that  non-parametric  statistical  methods  play  an  important  role 
in  evaluating  whether  metrics  satisfy  the  validity  criteria.  A 
detailed  example  of  the  application  of  the  methodology  was  presented. 
Although  it  was  not  an  objective  of  our  research,  we  found  in  the 
example  that  a  single  metric  was  sufficient  to  measure  quality. 

Future  research  is  needed  to  extend  and  improve  the  methodology  by 
finding  answers  to  the  following  questions: 

o  To  what  extent  are  metrics  that  have  been  validated  on  one  project, 
using  our  criteria,  valid  measures  of  quality  on  future  projects  -- 
both  si mi  lax  and  different  projects? 

o  Can  optimum  values  of  'V  ,  '  B',   'A',  and  '  P'  be  determined  by 

balancing  the  'cost'  of  setting  the  threshold  of  validity  too  high 

versus  the  'cost'  of  setting  it  too  low  in  order  to  reduce 
subjectivity  in  selecting  these  values? 

o  Can  optimum  critical  values  of  metrics  be  found  for  the 
discriminative  power  criterion  by  using  the  'costs'  of 
misclassif ication  in  order  to  eliminate  the  calculation  of  these 
values  by  trial  and  error? 
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APPENDIX  A. 

C:  Complexity,  S:  Number  of  Source  Statements  (excluding  comments) 
E:  Error  Count 

Procedures  with  No  Errors 


c 

S 

E 

Project 

C 

S 

E 

Project 

2 

6 

0 

1 

1 

3 

0 

4 

1 

8 

0 

1 

1 

3 

0 

4 

1 

11 

0 

1 

1 

3 

0 

4 

1 

4 

0 

1 

1 

5 

0 

4 

3 

18 

0 

1 

1 

5 

0 

4 

3 

15 

0 

1 

1 

6 

0 

4 

1 

3 

0 

2 

1 

9 

0 

4 

1 

3 

0 

2 

1 

6 

0 

4 

1 

3 

0 

2 

1 

8 

0 

4 

1 

3 

0 

2 

1 

9 

0 

4 

1 

3 

0 

2 

1 

9 

0 

4 

1 

3 

0 

2 

2 

4 

0 

4 

1 

3 

0 

2 

2 

7 

0 

4 

1 

3 

0 

2 

2 

9 

0 

4 

1 

5 

0 

2 

4 

56 

0 

4 

1 

5 

0 

2 

1 

24 

0 

4 

1 

5 

0 

2 

2 

13 

0 

4 

1 

13 

0 

2 

2 

13 

0 

4 

1 

3 

0 

2 

2 

10 

0 

4 

1 

3 

0 

2 

2 

9 

0 

4 

1 

3 

0 

2 

2 

12 

0 

4 

1 

3 

0 

2 

5 

21 

0 

4 

1 

3 

0 

2 

5 

49 

0 

4 

1 

3 

0 

2 

3 

19 

0 

4 

1 

3 

0 

2 

4 

20 

0 

4 

1 

2 

0 

4 

2 

6 

0 

4 

1 

2 

0 

4 

2 

12 

0 

4 

1 

7 

0 

4 

2 

9 

0 

4 

1 

5 

0 

4 

2 

10 

0 

4 

1 

7 

0 

4 

1 

21 

0 

4 

1 

5 

0 

4 

4 

21 

0 

4 

1 

5 

0 

4 

3 

11 

0 

4 

1 

5 

0 

4 

2 

13 

0 

4 

1 

5 

0 

4 

3 

14 

0 

4 

1 

4 

0 

4 

7 

19 

0 

4 

1 

3 

0 

4 

2 

15 

0 

4 

1 

3 

0 

4 

2 

10 

0 

4 

1 

3 

0 

4 

2 

17 

0 

4 

1 

3 

0 

4 

3 

19 

0 

4 

1 

3 

0 

4 

3 

15 

0 

4 

2 

15 

0 

4 

29 
Procedures  with  Errors 


c 

S 

E 

Project 

C 

S 

E 

Project 

2 

14 

1 

1 

4 

26 
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2 
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26 

5 

1 

16 

94 

8 

2 

5 

7 

2 

1 

2 

13 

1 

3 

5 

21 

1 

1 

6 

83 

1 

4 

2 

6 

1 

1 

5 

28 

1 

4 

1 

3 

1 

2 

8 

37 

5 

4 

1 

11 

1 

2 

3 

13 

2 

4 

1 

8 

1 

2 

3 

16 

1 

4 

2 

15 

3 

2 

7 

34 

1 

4 

8 

45 

3 

2 

5 

24 

1 

4 

4 

18 

1 

2 

4 

18 

3 

4 

6 

54 

3 

2 

5 

35 

2 

4 

2 

34 

2 

2 

13 

49 

5 

4 

4 

19 

1 

2 

4 

19 

1 

4 

5 

30 

2 

2 

4 

27 

1 

4 

4 

17 

2 

4 
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